Data Cleaning Protocol (revised)

The aim of this document is to derive and present a method of multiple steps to be taken that convert the raw personal environmental data into an analysis ready dataset. For this I selected the data of 5 participants: ACT001D (very good data from visual inspection), ACT004S and ACT014F (some poor data), ACT003C and ACT032V (very poor data). The variables to be cleaned are temperature, relative humidity RH, and noise.

Main reasons for cleaning are:

The steps of cleaning after revision are the following:

  1. Study design limitations (extracted from PVL)
  2. Physically possible (temperature > 0K, noise > 0 dB, RH < 100% etc).
  3. Physically plausible (temperature < -10 °C etc) especially important for taped temperature filtering
  4. Variability (variable specific threshold of standard deviation over an extended period of time) especially important for worn temperature/humidity

1. Study Design: Observation period and excluding PVL-visits

First, data has to be excluded that was taken outside the observation window and during personal visit log times if the devices were changed. The data was cut the the observation window in th data compiling but the checking whether the device was changed will be done here. No additional values were excluded from the chosen 5 individuals.


2. Physically possible

Every Variable (temperature, RH, noise) has its physical limits that the following:

  1. Temperature: < -273 °C
  2. RH: < 0 % and > 100 %
  3. Noise: < 0 dB
# House
data_H <- data_H |>
  mutate(IBH_TEMP_01 = if_else(IBH_TEMP < -273, 1, 0),
         IBH_HUM_01 = if_else(IBH_HUM < 0 | IBH_HUM > 100, 1, 0))

# Worn
data_W <- data_W |>
  mutate(IBW_TEMP_01 = if_else(IBW_TEMP < -273, 1, 0),
         IBW_HUM_01 = if_else(IBW_HUM < 0 | IBW_HUM > 100, 1, 0))

# Taped
data_T <- data_T |>
  mutate(IBT_TEMP_01 = if_else(IBT_TEMP < -273, 1, 0))

# Noise
data_N <- data_N |>
  mutate(NS_01 = if_else(NS < 0, 1, 0))

No plots are shown here because there are no impossible values in the example data.


3. Physically plausible

The plausible range is to some degree subjective, depends on the observation surroundings and changes not only depending on the variable, but also what the variable describes (temperature taped and house). Therefore now we need to start with device specific variable value ranges.

  1. House: Temperature < 0 °C and > 55 °C, RH: no additional filtering
  2. Worn: Temperature < 10 °C and > 45 °C, RH: no additional filtering
  3. Taped: Temperature below the 10th percentile (no upper filtering because taped temperature almost always is greater than house temperature) (25th percentile was too high)
  4. Noise: no additional filtering
# House
data_H <- data_H |>
  mutate(IBH_TEMP_02 = if_else(IBH_TEMP < 0 | IBH_TEMP > 55, 1, 0))

# Worn
data_W <- data_W |>
  mutate(IBW_TEMP_02 = if_else(IBW_TEMP < 15 | IBW_TEMP > 45, 1, 0))

# Taped

# IQR 
Q25 = quantile(data_T$IBT_TEMP, .10)
data_T <- data_T |>
  mutate(IBT_TEMP_02 = if_else(IBT_TEMP < Q25, 1, 0))


4. Variability

The variability between variables and devices differs significantly (eg. humidity house and worn). Because we are interested in stress experienced by the individuals, it is important to not filter out extreme but realistic conditions as these represent the largest stress impact. However we do want to filter out worn measurements that resemble the variance of the house measurements and indicate the the device was not worn. We use the moving standard deviation of 3 left aligned humidity values. As an additional measure to prevent filtering out reasonable values, we filter only measurements if the standard deviation has been too low for 2 consecutive measurements. We initially considered 4 values, however, in the averaging process this low threshold then introduced “wrong” data back into the hourly averages.

  1. Worn: threshold for humidity sd: x
x = 1

# Worn
data_W <- data_W |>
  mutate(IBW_TEMP_04_intermediate = if_else(IBW_HUM_MSD < x, 1, 0),
         IBW_HUM_04_intermediate = if_else(IBW_HUM_MSD < x, 1, 0),
         IBW_TEMP_04 = rollmean(IBW_TEMP_04_intermediate, k = 2, fill = NA, align = "left"),
         IBW_HUM_04 = rollmean(IBW_HUM_04_intermediate, k = 2, fill = NA, align = "left"))


Combination of all cleaning methods

The plots below show the cleaned data including all cleaning methods. Light colors indicate cleaned original data.


Report on the data

Note: Noise was now averaged with respect to the inherint log-scale.

Distributions

Diurnal Cycle

Descriptive Statistics

Variable Mean Median Standard Deviation
IBH_HUM 40.60369 38.68000 13.4253385
IBH_TEMP 27.22120 27.09375 2.6801571
IBW_HUM 41.59404 41.07000 12.6813064
IBW_TEMP 31.08556 31.72396 3.7832304
IBT_TEMP 34.63741 34.68750 0.7487054
NS 59.15517 55.85660 62.5825075

Scatterplots